This report aims to identify the variables that are most likely to predict the quality of white wine. This predicion will be established utilizing a dataset which consists of the quality ratings of approximately 5000 white wines. The dataset contains 11 variables measuring several chemical properties of the wine. This report will also explore the relationships between the chemical properties to determine which properties affect other properties.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
There appear to be a few outliers for fixed acidity on the higher end of the spectrum. The maximum value for fixed acidity is 14.2, which appears to be very high, as the typical range is around 5-8.
Volatile acidity has a similar-looking distribuution as fixed acidity. The typical range is from 0.25-0.35. This variable also has some higher-value outliers, giving the data a right skew. I am curious to see what quality rating these outliers recieved.
The bulk of white wines have a citric acid content of approximately 0.2-0.4 g/dm^3. The maximum value for citric acid is 1.66 g/dm^3. I know that citric acid can create a sour taste, so I am curious how this would affect the quality rating, and if more or less acidic wines are prefered by wine experts. While the outliers are easy to identify with the measurements provided, I wonder how prevalent the effect of the citric acid is to the wine experts.
A lower amount of residual sugar is clearly more common for white wines. However, there are a few outliers with much greater amounts of residual sugar. As this is clearly not typical of white wines, I am curious how a higher amount of sugar would affect the quality rating.
Chlorides is a right-skewed variable, with the bulk of the measurements falling between 0.03 and 0.06. The second histogram shows the variable on a log scale.
Free sulfur dioxide is another variable with a significant right skew, so I log-transformed the data to better view the distribution. On a log scale, the distribution appears normal.
Total sulfur dioxide has a similar distribtuion to free sulfur dioxide. I am assuming these two variables will have a pretty strong correlation. As the free sulfur dioxide increases, it would make sense that the total sulfur dioxide measure would increase as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The majority of white wines in this sample have a density between .99 and 1.0. There is a slight right skew to the variable, as there are some wines with densities above 1.0. These are clearly outside of the typical range as displayed in the above histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The variable pH appears to be normally distributed. As indicated by the range of values, white wine is an acidic substance, with an average pH of 3.188. It will be interesting to see which pH has the highest quality rating, as well as the relationship between pH and citric acid content.
The variable sulphates has a bimodal appearance, with a tail off to the right of the graph illustrating the right skew in the data.
There is a lot of variance in the histogram for alcohol content, as the alcohol content is measured to one decimal place. However, it does appear that most white wines fall between 9 and 11 percent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The most common quality rating for the sample is 6. The quality ratings appear to be pretty evenly distributed, with no significant skew.
There are 4,898 white wines in the dataset, with 11 variables measuring different chemical properties of the wine (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol). The output variable, quality, is based on a scale of 0 to 10, zero being the worst and ten being the best. These quality ratings were determined by at least three wine experts.
Upon researching wine tasting, I found that the main variables that would likely influence the quality rating are pH, alcohol content, and residual sugar. I will be exploring the effects of these variables on the quality rating.
I would also like to explore the effect density has on the quality rating, as the density of wine is reflective of the alcohol and residual sugar content. Additionally, I would like to explore how well the acidic measurements (fixed acidity, volatile acidity, and citric acid) explain the pH of the wine.
Yes, I cleaned up the alcohol variable by cutting it into the new variable: alcohol_bucket. This eliminated some of the noise in the alcohol histogram by organizing the data into whole number percentages, rather than to a decimal point. The purpose of this transformation was to identify the most typical alcohol percentage for white wines.
Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and density all are right-skewed. I examined a few of these on a log scale to limit the impact of the skew on the visual.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
There are not many strong relationships in the dataset. The strongest correlation is between density and residual sugar, with a pearson correlation coefficient of 0.83896645. Of all the variables, alcohol has the strongest correlation coefficient with quality.
In this particular scatterplot, it is difficult to identify whether or not there is a relationship between alcohol content and quality rating. To better examine this realtionship, we can look at the average quality rating by alcohol content.
Here, we can identify a general upward trend. It appears that as the alcohol content increases, the average quality rating increases as well. However, there seems to be a lot of variation amongst the middle-range alcohol content wines. To explore this issue, the alcohol variable can be cut so that the values are grouped within single increment integers.
##
## (8,9] (9,10] (10,11] (11,12] (12,13] (13,14] (14,15]
## 500 1583 1252 850 609 100 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
It does appear that there is are a greater number of wines from the dataset that fall within this region of high variation. To eliminate some of the noise, the realtionship between alcohol and quality can be viewed in a graph using the cut variable version of alcohol content.
In this plot, it is much easier to observe the upward trend in the data. In fact, there is only one region where the quality decreases as the alcohol content increases (between the first and second buckets).
During some research on wine quality, I discovered that acidity is a key characteristic when tasting wine. I decided to take a look at the relationship between citric acid content and quality rating. I was unable to discover a significant realtionship, although there are two very low points in the plot, which I find intersting. Perhaps pH provides a stronger relationship.
There seems to be a greater variance on the higher and lower ends of the pH spectrum in the plot against average quality. Toward the middle range of pH on the plot, the average quality rating stays around 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The variable pH extends to three decimal places, which may explain some of the noise present in the graph. It may be beneficial to cut this variable as done with the alcohol variable.
## (2.7,2.9] (2.9,3.1] (3.1,3.3] (3.3,3.5] (3.5,3.7] (3.7,3.9]
## 101 1348 2442 866 125 16
Although there is not a clear linear relationship between the pH and quality of the wine, it is evident that on average, white wines with a pH within the range 3.3 to 3.5 have the highest quality rating.
I find it interesting that cirtic acid and pH do not correlate as strongly as I had expected. After doing some reaserach, I found that “the most prevalent acids found in wine are tartaric acid, malic acid, and citric acid” (winefolly.com), and that these acids combined tend to influence the pH. Instead, volatile or fixed acidity may be better indicators of pH.
First, I examined the relationship between fixed and volatile acidity to see if there was any correlation between the two variables. As discovered in the correlation matrix, the two variables have a correlation coefficient of -0.02269729, demonstrating no real evidence of a linear realtionship. The scatterplot of the two variables emphasizes the fact that there is no linear relationship as the data is clustered in the lower left portion of the graph.
I then looked at the relationship of each variable individually with quality, and was unable to identify a clear realtionship with quality for either.
Although the measures of acidity did not show a relationship with quality, a relationship is evident between fixed acidity and pH. As fixed acidity increases, the pH decreases, which makes sense, as a greater acidic level would indicate a lower value on the pH scale.
After adjusting for overplotting, the moderate negative relationship between alcohol and residual sugar and wine appears. It is more common for lower alcohol percentage wines to have higher sugar contents.
Alcohol and residual sugar have a moderate negative linear relationship.
Alcohol content has a stronger correlation coefficient with quality than any other variable, with a pearson correlation coefficient value of 0.435574715. Moving into the multivariate analysis section, I will explore the effects of alcohol on quality in conjunction with other variables of interest.
The variables describing the acidity of wine did not display much influence on the quality of wine, however, I did find that fixed acidity has the greatest impact on the pH of the wine.
Density is highly correlated with both alcohol content and residual sugar. As residual sugar increases, the density of wine also increases. As the alcohol content increases, the density of the wine decreases.
Also, I discovered that total sulfur dioxide and density have a moderate positive linear relationship. This was not necessarily something I was searching for in the dataset, but it is interesting to note as I explore density more thouroughly.
The strongest correlation is between density and residual sugar, with a pearson correlation coefficient of 0.83896645, indicating that as the amount of residual sugar in the wine increases, the density increases relatively.
In this plot, pH and fixed acidity maintain their negative relationship. The greater the fixed acidity, the lower the pH. However, I was still unable to identify a relationship between acidity and quality rating as I had initially expected.
Alcohol and residual sugar influence the density of wine, as discovered in the bivariate plots section. Alcohol and density have a negative linear relationship, while residual sugar and density have a positive linear relationship. The plot exhibits a decreasing trend, indicating the negative relationship between alcohol and density, and the higher sugar points fall higher on the plot, indicating the positve relationship between residual sugar and density.
This plot follows a similar structure as the previous plot. Alcohol and total sulfur dioxide have a moderate negative linear relationship, while density and total sulfur dioxide have a moderate positive linear relationship.
I wanted to re-examine the relationship between fixed acidity, volatile acidity and quality of wine by observing all three variables on the same plot. However, the output simply re-enforces my previous conclusions of no significant relationship.
Density and total sulfur dioxide display a moderate positive relationship. As the total sulfur dioxie in wine increases, the density of the wine increases as well. This would also indicate that wines with higher total sulfur dioxide likely have a greater amount of residual sugar, and lower alochol percentages. Although, these relationships are simply a result of the relationship of each variable to density. I do not believe total sulfur dioxide would have a direct effect on alcohol or residual sugar content.
As we would expect, this plot does display the positive relationship between total sulfur dioxide and free sulfur dioxide. Wines with higher values of total sulfur dioxide tend to have greater amounts of free sulfur dioxide.
I took a look at the plot between alcohol, quality and density as I recognized relationships between alcohol and quality, and alcohol and density. I wanted to explore whether I could view the interactions between the variables in a single plot. Although density and quality do not have a clear realtionship, the tendency of lower alcohol wines to have a greater density is apparent, with the darker colored points falling on the lower end of the x-axis.
The plots of alcohol, density and residual sugar are similar in nature across quality ratings. There is a significant point in the 6 rating plot, in which residual sugar is very high, which gives the wine a high density. Otherwise, across the ratings, as residual sugar is lower, the density is lower. Also, wines with higher amounts of residual sugar tend to have lower alcohol percentages.
The main relationships I explored were the ones between residual sugar, alcohol, and quality. Although pH was a feature of interest, it did not necessarily show a relationship with any of the other features of interest from a multivariate standpoint.
The relationship between density, alcohol and residual sugar was very apparent when viewed in a single plot. As previously stated, density and alcohol percentage have a negative relationship, while density and residual sugar have a positive relationship.
I was very surprised that the acidity variables did not have a large influence on quality ratings. When researching wine tasting, I found that “Acidity gives wine its tart and sour taste” (Winefolly). Therefore, I assumed that measures of acidity (pH, citric acid, volatile acidity and fixed acidity) would have a clear effect on the wine quality. However, I was unsuccessful in identifying a relationship.
I did discover a relationship between total sulfur dioxide and density. The two variables have a moderate positive linear relationship. Although this was not necessarily something I was searching for, it is interesting to note.
As discovered in the bivariate plots section, alcohol was the characteristic of wine with the greatest influence on quality. I wanted to develop a model that best depicted this relationship. Looking at the original scatterplot of alcohol content v. quality, it was very difficult to examine the relationship between the two variables. After adjusting for the noise in the bivariate section by cutting the variable into whole number percentages (e.g. (9,10]), I wanted to see how well this adjustment fit the original data. To do so, I used the original graph of average quality rating v. alcohol percentage, and added two different smoothers. The blue smoother was created using the default settings of the geom_smooth() function (method = ‘gam’ and formula ’y ~ s(x, bs = “cs”)). The red smoother utilized the linear method.
The red smoother exhibits an increasing trend, while the blue smoother exhibits an increasing trend just after 9% alcohol content. The blue smoother has the same initial dip as found in the plot of average quality rating by alcohol content range. If interested in prediciting the quality of wine solely from alcohol content, this may be the best model for said purpose. Another aspect of the plot that is made more apparent with the addition of the smoothers is that there is that the greatest variance in average quality occurs between 10 and 12.5%. Further investigation would be necessary in determining the source of this higher variance.
In the bivariate section, I initially examined the plot between density, alcohol percentage and residual sugar, with the points colored in accordance to the amount of residual sugar present in the wine. I wanted to examine the graph with residual sugar plotted against density, colored by alcohol percentage. Both graphs portray essentially the same concepts, yet it is useful to examine the information from both plots.
We can see that alcohol percentage and residual sugar both have a linear correlation with the density of white wine.
For my final plot, I wanted to re-examine the relationship between pH and quality. The plot provided earlier between the cut variable ph_buckets and quality displayed an ideal pH within the range of (3.3, 3.5]. However, the dataset description explains that the dataset is not balanced, and contains more mid-range quality wines, and fewer on the extreme ends of the spectrum (very high or low quality). This makes sense, as the average quality rating for a wine in the pH range of (3.3, 3.5] was just under 6.1. Therefore, I wanted to take a look at the pH distribution in terms of each quality rating, to determine whether or not this ideal range was consistent across quality ratings, or whether it simily stemmed from the mid-range wines.
While these distributions do show that the mid-range wines have the greatest influence on the values, we can see that as the quality rating increases above the mid-rating, the pH distribution is less dispersed, and falls closer to the (3.3, 3.5] range. For instance, the distribution of the wines rated 7 ranges from about 2.75 to 3.8; for wines with a rating of 8 the distribution ranges from about 2.8 to 3.6; and the wines with a rating of 9 have the least spread in terms of pH, in which these fall approximately between 3.1 and 3.5.
After my initial research on wine tasting and wine attributes, I was surprised by the lack of correlation I discovered in the dataset. There are other variables I would have liked to see in the dataset such as price of the wine, to test whether or not higher priced wines are typically rated at higher qualities. Additionally, it would be interesting to see how the wine experts percieved the wine. For instance, how accurately they can identify an acidic wine.
Additionally, there are many questions that I have that can not necessarily be answered solely by this dataset. One pattern I am curious about is the heavy right skew throughout the dataset. I wonder why so many variables had high-value outliers.
After examining the dataset, it can be concluded that alcohol has the greatest impact on the quality of white wine; the higher the alcohol content, the higher the quality rating, on average.
In the future I would like to explore the red wine dataset to determine which variables are most significant in determining the quality of red wine, and whether or not the variables in the dataset show similar patterns to the white wine dataset.
Resources:
Understanding Acidity in Wine. (2015, December 09). Retrieved from https://winefolly.com/review/understanding-acidity-in-wine/
Wine Chemistry, Science Direct. (n.d.). Retrieved from https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/ wine-chemistry